Let us pull the dataset from the file
US_Accidents_Dec21_updated.csv into a dataframe
df_us_acc. As this file is huge in size, we cannot upload
it on GitHub. Thus, we are storing the file in a folder called
us_accidents_dataset that is located in one folder above
the project folder.
df_us_acc <- data.frame(read.csv('../../us_accidents_dataset/US_Accidents_Dec21_updated.csv'))
str(df_us_acc)
## 'data.frame': 2845342 obs. of 47 variables:
## $ ID : chr "A-1" "A-2" "A-3" "A-4" ...
## $ Severity : int 3 2 2 2 3 2 2 2 2 2 ...
## $ Start_Time : chr "2016-02-08 00:37:08" "2016-02-08 05:56:20" "2016-02-08 06:15:39" "2016-02-08 06:51:45" ...
## $ End_Time : chr "2016-02-08 06:37:08" "2016-02-08 11:56:20" "2016-02-08 12:15:39" "2016-02-08 12:51:45" ...
## $ Start_Lat : num 40.1 39.9 39.1 41.1 39.2 ...
## $ Start_Lng : num -83.1 -84.1 -84.5 -81.5 -84.5 ...
## $ End_Lat : num 40.1 39.9 39.1 41.1 39.2 ...
## $ End_Lng : num -83 -84 -84.5 -81.5 -84.5 ...
## $ Distance.mi. : num 3.23 0.747 0.055 0.123 0.5 ...
## $ Description : chr "Between Sawmill Rd/Exit 20 and OH-315/Olentangy Riv Rd/Exit 22 - Accident." "At OH-4/OH-235/Exit 41 - Accident." "At I-71/US-50/Exit 1 - Accident." "At Dart Ave/Exit 21 - Accident." ...
## $ Number : num NA NA NA NA NA NA NA NA NA NA ...
## $ Street : chr "Outerbelt E" "I-70 E" "I-75 S" "I-77 N" ...
## $ Side : chr "R" "R" "R" "R" ...
## $ City : chr "Dublin" "Dayton" "Cincinnati" "Akron" ...
## $ County : chr "Franklin" "Montgomery" "Hamilton" "Summit" ...
## $ State : chr "OH" "OH" "OH" "OH" ...
## $ Zipcode : chr "43017" "45424" "45203" "44311" ...
## $ Country : chr "US" "US" "US" "US" ...
## $ Timezone : chr "US/Eastern" "US/Eastern" "US/Eastern" "US/Eastern" ...
## $ Airport_Code : chr "KOSU" "KFFO" "KLUK" "KAKR" ...
## $ Weather_Timestamp : chr "2016-02-08 00:53:00" "2016-02-08 05:58:00" "2016-02-08 05:53:00" "2016-02-08 06:54:00" ...
## $ Temperature.F. : num 42.1 36.9 36 39 37 35.6 33.8 33.1 39 32 ...
## $ Wind_Chill.F. : num 36.1 NA NA NA 29.8 29.2 NA 30 31.8 28.7 ...
## $ Humidity... : num 58 91 97 55 93 100 100 92 70 100 ...
## $ Pressure.in. : num 29.8 29.7 29.7 29.6 29.7 ...
## $ Visibility.mi. : num 10 10 10 10 10 10 3 0.5 10 0.5 ...
## $ Wind_Direction : chr "SW" "Calm" "Calm" "Calm" ...
## $ Wind_Speed.mph. : num 10.4 NA NA NA 10.4 8.1 2.3 3.5 11.5 3.5 ...
## $ Precipitation.in. : num 0 0.02 0.02 NA 0.01 NA NA 0.08 NA 0.05 ...
## $ Weather_Condition : chr "Light Rain" "Light Rain" "Overcast" "Overcast" ...
## $ Amenity : chr "False" "False" "False" "False" ...
## $ Bump : chr "False" "False" "False" "False" ...
## $ Crossing : chr "False" "False" "False" "False" ...
## $ Give_Way : chr "False" "False" "False" "False" ...
## $ Junction : chr "False" "False" "True" "False" ...
## $ No_Exit : chr "False" "False" "False" "False" ...
## $ Railway : chr "False" "False" "False" "False" ...
## $ Roundabout : chr "False" "False" "False" "False" ...
## $ Station : chr "False" "False" "False" "False" ...
## $ Stop : chr "False" "False" "False" "False" ...
## $ Traffic_Calming : chr "False" "False" "False" "False" ...
## $ Traffic_Signal : chr "False" "False" "False" "False" ...
## $ Turning_Loop : chr "False" "False" "False" "False" ...
## $ Sunrise_Sunset : chr "Night" "Night" "Night" "Night" ...
## $ Civil_Twilight : chr "Night" "Night" "Night" "Night" ...
## $ Nautical_Twilight : chr "Night" "Night" "Night" "Day" ...
## $ Astronomical_Twilight: chr "Night" "Night" "Day" "Day" ...
First, let’s check the percentage of NA’s present in each columns of the dataset.
(colMeans(is.na(df_us_acc)))*100
## ID Severity Start_Time
## 0.000000 0.000000 0.000000
## End_Time Start_Lat Start_Lng
## 0.000000 0.000000 0.000000
## End_Lat End_Lng Distance.mi.
## 0.000000 0.000000 0.000000
## Description Number Street
## 0.000000 61.290031 0.000000
## Side City County
## 0.000000 0.000000 0.000000
## State Zipcode Country
## 0.000000 0.000000 0.000000
## Timezone Airport_Code Weather_Timestamp
## 0.000000 0.000000 0.000000
## Temperature.F. Wind_Chill.F. Humidity...
## 2.434646 16.505678 2.568830
## Pressure.in. Visibility.mi. Wind_Direction
## 2.080593 2.479350 0.000000
## Wind_Speed.mph. Precipitation.in. Weather_Condition
## 5.550967 19.310789 0.000000
## Amenity Bump Crossing
## 0.000000 0.000000 0.000000
## Give_Way Junction No_Exit
## 0.000000 0.000000 0.000000
## Railway Roundabout Station
## 0.000000 0.000000 0.000000
## Stop Traffic_Calming Traffic_Signal
## 0.000000 0.000000 0.000000
## Turning_Loop Sunrise_Sunset Civil_Twilight
## 0.000000 0.000000 0.000000
## Nautical_Twilight Astronomical_Twilight
## 0.000000 0.000000
Here, the highest number of NAs is present in the column
Number, followed by Precipitation.in.,
Wind_Chill.F. and some other columns. As we don’t require
the column Number, we will drop the column. We have decided
to keep the rest of the columns as they are part of our analysis. We are
also dropping Description column for faster code
execution.
df_us_acc <- subset(df_us_acc, select = -c(Number, Description))
As we have low number of NA data for other columns, we can just drop those records.
df_us_acc <- drop_na(df_us_acc)
Next, we will extract the year out of the Start_Time
column to check the data distribution over the year.
df_us_acc$year<-format(as.Date(df_us_acc$Start_Time, format="%Y-%m-%d"),"%Y")
ggplot(df_us_acc, aes(x = year, fill=year)) +
geom_bar()
As we can see in the yearly distribution graph, the dataset has been updated with multiple data sources. Thus, we decided that the year 2021 will be the optimal subset of the data.
clean_acc21 <- subset(df_us_acc, year==2021)
Let’s extract the month from the Start_Time and check
the monthly distribution.
clean_acc21$Month<-as.numeric(format(as.Date(clean_acc21$Start_Time, format="%Y-%m-%d"),"%m"))
clean_acc21$Hour<-hour(clean_acc21$Start_Time)
Now, we will check the Severity distribution in the data.
ggplot(df_us_acc, aes(x = Severity, fill=Severity)) +
geom_bar()
As we can see in the graph, the severity levels are imbalanced. We don’t have a higher number of severe impacts on the traffic due to accidents as compared to the less severe. This is also true in regards to the real world. Thus, we have decided to merge level 1 & 2 into “Not Severe” & 3 & 4 into “Severe” to make our analysis more specific.
clean_acc21 <- clean_acc21 %>%
mutate(Is_severe = if_else(Severity == 1 | Severity ==2 , "Not Severe", "Severe"))
clean_acc21$Is_severe <- as.factor(clean_acc21$Is_severe)
For some initial EDA, we were curious to see how the data looks on a map, particularly the DC area as we currently live here. Thus, the map below shows the accidents that took place in 2021 in the DC area.
df_map<-dplyr::select(clean_acc21, State, Start_Lat, Start_Lng)
df_map_DC <- df_map %>% filter(State == "DC")
df_map_DC_sf <- st_as_sf(df_map_DC, coords = c("Start_Lng", "Start_Lat"), crs = 4326)
mapview(df_map_DC_sf, map.types = "Stamen.Toner",col.regions=("red"))
To answer the first SMART question we have, which is “Does weather affect the severity of traffic?”, we wanted to check the distribution of data for numerical weather variables first.
tempHist <- ggplot(clean_acc21, aes(x=Temperature.F.)) + geom_histogram(color="black", fill = "red")+
ggtitle("Histogram of Temperature(F) for accidents")
windcHist <- ggplot(clean_acc21, aes(x=Wind_Chill.F.)) + geom_histogram(color="black", fill = "orange")+
ggtitle("Histogram of Wind chill for accidents")
humidHist <- ggplot(clean_acc21, aes(x=Humidity...)) + geom_histogram(color="black", fill = "yellow")+
ggtitle("Histogram of Humidity for accidents")
windsHist <- ggplot(clean_acc21, aes(x=Wind_Speed.mph.)) + geom_histogram(color="black", fill = "navy")+
ggtitle("Histogram of Wind Speed for accidents")
pressHist <- ggplot(clean_acc21, aes(x=Pressure.in.)) + geom_histogram(color="black", fill = "green")+
ggtitle("Histogram of Pressure for accidents")
visibHist <- ggplot(clean_acc21, aes(x=Visibility.mi.)) + geom_histogram(color="black", fill = "blue")+
ggtitle("Histogram of Visibility for accidents")
precipHist <- ggplot(clean_acc21, aes(x=Precipitation.in.)) + geom_histogram(color="black", fill = "purple")+
ggtitle("Histogram of Precipitation for accidents")
grid.arrange(tempHist, windcHist, humidHist, windsHist, pressHist, visibHist, precipHist, ncol=3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We created histograms between numbers of accidents and weather
condition elements. From those histograms, we found that Temperature,
Wind chill, and Humidity have left-skewed distributions.
For the rest of element, which are Wind Speed, Pressure, Visibility, and
Precipitation, they have quite close mean and median with a few
outliers.
wooutlier_winds <- outlierKD2(clean_acc21, Wind_Speed.mph., rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)
## Outliers identified: 24710
## Proportion (%) of outliers: 1.8
## Mean of the outliers: 24.14
## Mean without removing outliers: 7.12
## Mean if we remove outliers: 6.82
## Outliers successfully removed
clean_acc21_woo <- outlierKD2(wooutlier_winds, Pressure.in., rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)
## Outliers identified: 101201
## Proportion (%) of outliers: 7.7
## Mean of the outliers: 26.25
## Mean without removing outliers: 29.41
## Mean if we remove outliers: 29.65
## Outliers successfully removed
So we tried to remove the outliers from Wind Speed and Pressure and without outliers, and from the generated plots we can see they are more normally distributed than the original data. But we decided to keep the outliers because it is natural to have outliers in the weather variables as the data covers a whole year. Also the outliers does not affect the result of T-test.
plot1 <- ggplot(clean_acc21, aes(x = Is_severe, y=Temperature.F.)) +
geom_boxplot() +
labs(title="Temperature by Severity", x="Severity", y = "Temperature(F)")
plot2 <- ggplot(clean_acc21, aes(x = Is_severe, y=Wind_Chill.F.)) +
geom_boxplot() +
labs(title="Wind Chill by Severity", x="Severity", y = "Wind Chill")
plot3 <- ggplot(clean_acc21, aes(x = Is_severe, y=Wind_Speed.mph.)) +
geom_boxplot() +
labs(title="Wind Speed by Severity", x="Severity", y = "Wind Speed")
plot4 <- ggplot(clean_acc21, aes(x = Is_severe, y=Humidity...)) +
geom_boxplot() +
labs(title="Humidity by Severity", x="Severity", y = "Humidity")
plot5 <- ggplot(clean_acc21, aes(x = Is_severe, y=Pressure.in.)) +
geom_boxplot() +
labs(title="Pressure by Severity", x="Severity", y = "Pressure")
plot6 <- ggplot(clean_acc21, aes(x = Is_severe, y=Visibility.mi.)) +
geom_boxplot() +
labs(title="Visibility by Severity", x="Severity", y = "Visibility")
plot7 <- ggplot(clean_acc21, aes(x = Is_severe, y=Precipitation.in.)) +
geom_boxplot() +
labs(title="Precipitation by Severity", x="Severity", y = "Precipitation")
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, plot7, ncol=3)
We tried to see the distribution of weather elements by two different
severity levels which are ‘Severe’ and ‘Not Severe’.
For Temperature, Wind chill, and Humanity, we can see the difference on range of data distribution and outliers by severity levels.
For the rest of element such as wind speed, pressure, visibility and precipitation, they still do not have a wide range of data but we can see the distribution by two severity levels more conveniently with boxplots.
box_clean2_severe = subset(clean_acc21, Is_severe == 'Severe')
box_clean2_notsevere = subset(clean_acc21, Is_severe == 'Not Severe')
print("Temperature.F")
## [1] "Temperature.F"
t.test(box_clean2_severe$Temperature.F., box_clean2_notsevere$Temperature.F.)
##
## Welch Two Sample t-test
##
## data: box_clean2_severe$Temperature.F. and box_clean2_notsevere$Temperature.F.
## t = -43.428, df = 22832, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.875845 -5.368348
## sample estimates:
## mean of x mean of y
## 57.73754 63.35964
print("Wind_Chill.F.")
## [1] "Wind_Chill.F."
t.test(box_clean2_severe$Wind_Chill.F., box_clean2_notsevere$Wind_Chill.F.)
##
## Welch Two Sample t-test
##
## data: box_clean2_severe$Wind_Chill.F. and box_clean2_notsevere$Wind_Chill.F.
## t = -42.876, df = 22802, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.518654 -5.948717
## sample estimates:
## mean of x mean of y
## 56.11613 62.34982
print("Humidity...")
## [1] "Humidity..."
t.test(box_clean2_severe$Humidity..., box_clean2_notsevere$Humidity...)
##
## Welch Two Sample t-test
##
## data: box_clean2_severe$Humidity... and box_clean2_notsevere$Humidity...
## t = 17.706, df = 22858, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.380009 2.972538
## sample estimates:
## mean of x mean of y
## 67.31599 64.63971
print("Wind_Speed.mph.")
## [1] "Wind_Speed.mph."
t.test(box_clean2_severe$Wind_Speed.mph., box_clean2_notsevere$Wind_Speed.mph.)
##
## Welch Two Sample t-test
##
## data: box_clean2_severe$Wind_Speed.mph. and box_clean2_notsevere$Wind_Speed.mph.
## t = -8.0218, df = 22802, p-value = 1.092e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3806752 -0.2311737
## sample estimates:
## mean of x mean of y
## 6.815942 7.121866
print("Visibility.mi.")
## [1] "Visibility.mi."
t.test(box_clean2_severe$Visibility.mi., box_clean2_notsevere$Visibility.mi.)
##
## Welch Two Sample t-test
##
## data: box_clean2_severe$Visibility.mi. and box_clean2_notsevere$Visibility.mi.
## t = -1.1754, df = 22823, p-value = 0.2398
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.05474086 0.01369841
## sample estimates:
## mean of x mean of y
## 9.053624 9.074146
print("Pressure.in.")
## [1] "Pressure.in."
t.test(box_clean2_severe$Pressure.in., box_clean2_notsevere$Pressure.in.)
##
## Welch Two Sample t-test
##
## data: box_clean2_severe$Pressure.in. and box_clean2_notsevere$Pressure.in.
## t = -30.44, df = 22554, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3006229 -0.2642501
## sample estimates:
## mean of x mean of y
## 29.12821 29.41065
print("Precipitation.in.")
## [1] "Precipitation.in."
t.test(box_clean2_severe$Precipitation.in., box_clean2_notsevere$Precipitation.in.)
##
## Welch Two Sample t-test
##
## data: box_clean2_severe$Precipitation.in. and box_clean2_notsevere$Precipitation.in.
## t = -2.8336, df = 23267, p-value = 0.004606
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0014187715 -0.0002585427
## sample estimates:
## mean of x mean of y
## 0.004886261 0.005724918
First, I divided the data into two different data by subsetting by the severity to check the means of Weather elements between two different severity levels will be same or not. Then we performed the two-sample t-test on Severity and weather elements since the weather numerical variables are quantitative and we have two samples based on the severity levels.
H0: The means of Temperature/WindChill/Humidity/Wind Speed/Pressure/Visibility/Precipitation will be same between different Severity levels. H1: The means of Temperature/WindChill/Humidity/Wind Speed/Pressure/Visibility/Precipitation will NOT be same between different Severity levels.
The p-value from all tests except for Visibility are lower than 0.05 so we can reject the H0 for every weather variables but Visibility, which means that means from weather variables except for Visibility were different by its severity level of traffic.
From these t-tests, we can conclude that numerical weather variables such as Temperature, WindChill, Humidity, Pressure, Wind Speed and Precipitation affect the severity of traffic.
However, numerical weather variables are not only variables from weather conditions. There are categorical weather variables in our dataset, such as Wind Directions and Weather Conditions.
# wind direction with severity
ggplot(clean_acc21, aes(Wind_Direction, ..prop.., group = Is_severe)) +
geom_bar(aes(fill = Is_severe)) +
#scale_y_continuous(labels = percent) +
labs(x = "Wind Direction",
y = "Proportion",
title = "Wind direction by Severity") +
theme(text = element_text(size=8))
We made a bar plot to see the distribution of the Severity by wind
directions. As the bar plot between Severity and Wind direction here
shows the similar distribution for each severity levels on most of wind
direction. So we can infer that wind direction does not affect much on
the severity of traffic and decided not to perform any statistical
analysis on the Wind Direction variable.
# Weather Condition percentage
unique(clean_acc21$Weather_Condition)
## [1] "Fair" "Fog"
## [3] "Mostly Cloudy" "Cloudy"
## [5] "Partly Cloudy" ""
## [7] "Light Rain" "Heavy T-Storm / Windy"
## [9] "Light Snow" "Rain"
## [11] "T-Storm" "Haze"
## [13] "Fair / Windy" "Smoke"
## [15] "Cloudy / Windy" "Snow"
## [17] "Heavy Snow" "Thunder"
## [19] "Thunder in the Vicinity" "N/A Precipitation"
## [21] "Heavy Rain" "Thunder / Windy"
## [23] "Heavy T-Storm" "Mostly Cloudy / Windy"
## [25] "Shallow Fog" "Mist"
## [27] "Partly Cloudy / Windy" "Snow / Windy"
## [29] "Light Rain with Thunder" "Light Snow / Windy"
## [31] "Rain / Windy" "Wintry Mix"
## [33] "Heavy Rain / Windy" "Drizzle"
## [35] "Light Drizzle" "Light Rain / Windy"
## [37] "Haze / Windy" "Light Snow and Sleet"
## [39] "Showers in the Vicinity" "T-Storm / Windy"
## [41] "Patches of Fog" "Light Freezing Rain"
## [43] "Sand / Dust Whirlwinds" "Light Freezing Drizzle"
## [45] "Fog / Windy" "Heavy Drizzle"
## [47] "Light Snow with Thunder" "Blowing Dust / Windy"
## [49] "Rain Shower" "Heavy Snow / Windy"
## [51] "Blowing Snow / Windy" "Light Rain Shower"
## [53] "Snow and Sleet" "Drizzle and Fog"
## [55] "Light Sleet" "Drizzle / Windy"
## [57] "Light Snow Shower" "Snow and Thunder / Windy"
## [59] "Light Sleet / Windy" "Smoke / Windy"
## [61] "Blowing Dust" "Wintry Mix / Windy"
## [63] "Blowing Snow" "Widespread Dust / Windy"
## [65] "Light Drizzle / Windy" "Squalls"
## [67] "Tornado" "Squalls / Windy"
## [69] "Hail" "Blowing Snow Nearby"
## [71] "Partial Fog" "Widespread Dust"
## [73] "Sand / Windy" "Thunder / Wintry Mix"
## [75] "Light Freezing Rain / Windy" "Light Snow and Sleet / Windy"
## [77] "Heavy Rain Shower / Windy" "Small Hail"
## [79] "Sand / Dust Whirlwinds / Windy" "Light Rain Shower / Windy"
## [81] "Thunder and Hail" "Freezing Rain"
## [83] "Heavy Sleet" "Snow Grains"
## [85] "Sleet" "Freezing Drizzle"
## [87] "Snow and Sleet / Windy" "Freezing Rain / Windy"
## [89] "Heavy Freezing Drizzle" "Heavy Freezing Rain"
## [91] "Blowing Sand"
WC <- clean_acc21 %>%
group_by(Weather_Condition) %>%
summarise(cnt = n()) %>%
mutate(freq = (round(cnt/sum(cnt), 3))*100 )%>%
arrange(desc(freq)) %>%
filter(freq > 1)
WC
## # A tibble: 8 × 3
## Weather_Condition cnt freq
## <chr> <int> <dbl>
## 1 Fair 686869 48.3
## 2 Cloudy 207687 14.6
## 3 Mostly Cloudy 186185 13.1
## 4 Partly Cloudy 124698 8.8
## 5 Light Rain 61960 4.4
## 6 Fog 24213 1.7
## 7 Light Snow 23197 1.6
## 8 Haze 19755 1.4
WC %>%
ggplot() +
geom_col(mapping = aes(x=reorder(Weather_Condition, -freq), y=freq, fill = Weather_Condition)) +
labs(x = "Weather Condition", y="%", title ="Top 8 Weather Conditions with accidents") +
theme(text = element_text(size=7))
Secondly, we took a look at the Weather Condition variable.
There are so many weather conditions in our dataset, so I tried to make
a barplot with the top 8 weather conditions when accidents happened. The
major weather conditions when car accidents happened were ‘Fair’,
‘Cloudy’, ‘Mostly Cloudy’, ‘Partly Cloudy’, ‘Light Rain’, ‘Fog’, ‘Light
Snow’, and ‘Haze’.
Now we can see that the most frequent weather was ‘Fair’, but since the Weather Conditions variable is divided into detailed conditions as you can see here with cloudy, mostly cloudy, and partly cloudy, so we decided to take a chi-squared test on all weather conditions and severity.
# Try the Chi-Squared Test on all Weather_Conditions and Severity
WCtable <- table(clean_acc21$Weather_Condition , clean_acc21$Severity)
xkabledply(WCtable, title = "Severity by Weather Conditions")
| 2 | 4 | |
|---|---|---|
| 3616 | 84 | |
| Blowing Dust | 68 | 0 |
| Blowing Dust / Windy | 67 | 5 |
| Blowing Sand | 1 | 0 |
| Blowing Snow | 20 | 0 |
| Blowing Snow / Windy | 29 | 0 |
| Blowing Snow Nearby | 2 | 0 |
| Cloudy | 203765 | 3922 |
| Cloudy / Windy | 3837 | 68 |
| Drizzle | 742 | 23 |
| Drizzle / Windy | 5 | 0 |
| Drizzle and Fog | 93 | 2 |
| Fair | 676490 | 10379 |
| Fair / Windy | 8755 | 277 |
| Fog | 23772 | 441 |
| Fog / Windy | 169 | 3 |
| Freezing Drizzle | 10 | 0 |
| Freezing Rain | 25 | 1 |
| Freezing Rain / Windy | 1 | 0 |
| Hail | 2 | 0 |
| Haze | 19623 | 132 |
| Haze / Windy | 315 | 19 |
| Heavy Drizzle | 60 | 0 |
| Heavy Freezing Drizzle | 1 | 0 |
| Heavy Freezing Rain | 1 | 0 |
| Heavy Rain | 6051 | 88 |
| Heavy Rain / Windy | 477 | 3 |
| Heavy Rain Shower / Windy | 1 | 0 |
| Heavy Sleet | 13 | 0 |
| Heavy Snow | 634 | 26 |
| Heavy Snow / Windy | 101 | 5 |
| Heavy T-Storm | 2883 | 37 |
| Heavy T-Storm / Windy | 310 | 4 |
| Light Drizzle | 3108 | 69 |
| Light Drizzle / Windy | 45 | 3 |
| Light Freezing Drizzle | 155 | 4 |
| Light Freezing Rain | 193 | 11 |
| Light Freezing Rain / Windy | 20 | 4 |
| Light Rain | 60990 | 970 |
| Light Rain / Windy | 1779 | 34 |
| Light Rain Shower | 56 | 0 |
| Light Rain Shower / Windy | 1 | 0 |
| Light Rain with Thunder | 4139 | 57 |
| Light Sleet | 55 | 2 |
| Light Sleet / Windy | 5 | 0 |
| Light Snow | 22689 | 508 |
| Light Snow / Windy | 1172 | 27 |
| Light Snow and Sleet | 40 | 0 |
| Light Snow and Sleet / Windy | 13 | 0 |
| Light Snow Shower | 24 | 0 |
| Light Snow with Thunder | 7 | 0 |
| Mist | 306 | 4 |
| Mostly Cloudy | 183719 | 2466 |
| Mostly Cloudy / Windy | 3155 | 53 |
| N/A Precipitation | 646 | 29 |
| Partial Fog | 1 | 0 |
| Partly Cloudy | 122913 | 1785 |
| Partly Cloudy / Windy | 1974 | 30 |
| Patches of Fog | 480 | 6 |
| Rain | 14270 | 196 |
| Rain / Windy | 656 | 7 |
| Rain Shower | 11 | 0 |
| Sand / Dust Whirlwinds | 8 | 0 |
| Sand / Dust Whirlwinds / Windy | 1 | 0 |
| Sand / Windy | 2 | 0 |
| Shallow Fog | 512 | 5 |
| Showers in the Vicinity | 314 | 5 |
| Sleet | 31 | 1 |
| Small Hail | 22 | 1 |
| Smoke | 3977 | 16 |
| Smoke / Windy | 19 | 0 |
| Snow | 2516 | 64 |
| Snow / Windy | 195 | 4 |
| Snow and Sleet | 85 | 7 |
| Snow and Sleet / Windy | 7 | 0 |
| Snow and Thunder / Windy | 2 | 0 |
| Snow Grains | 5 | 0 |
| Squalls | 5 | 0 |
| Squalls / Windy | 5 | 2 |
| T-Storm | 4989 | 61 |
| T-Storm / Windy | 194 | 0 |
| Thunder | 4958 | 36 |
| Thunder / Windy | 178 | 1 |
| Thunder / Wintry Mix | 5 | 1 |
| Thunder and Hail | 1 | 0 |
| Thunder in the Vicinity | 5572 | 76 |
| Tornado | 7 | 0 |
| Widespread Dust | 8 | 0 |
| Widespread Dust / Windy | 18 | 0 |
| Wintry Mix | 2721 | 92 |
| Wintry Mix / Windy | 47 | 0 |
chitest = chisq.test(WCtable)
## Warning in chisq.test(WCtable): Chi-squared approximation may be incorrect
chitest
##
## Pearson's Chi-squared test
##
## data: WCtable
## X-squared = 984.98, df = 90, p-value < 2.2e-16
To identify if the severity of traffic dependent on Weather Condition, we took a Chi-Squared test. As Weather Condition is a categorical variables, and we treated the severity in an original form which is numerical so we can see the dependency between weather conditions and the severity using a chi-squared test.
H0 : Severity and weather conditions are independent. H1 : Severity
and weather conditions are NOT independent.
As we can see here, the the P-value from Chi-squared test is lower than
0.05 for Weather Conditions variable so we can reject the H0. Which
means the Severity of traffic and weather conditions are Dependent.
From these analysis on numerical and categorical weather variables, we can answer our SMART question about the impact for weather on the severity of traffic. We concluded that the numerical weather condition elements except for visibility affect the severity of traffic. However, the wind direction does not affect much the severity of traffic because it does not show differences on severity by each of directions. For the weather conditions variables, we can observe that weather conditions at the time accidents happened affect the severity of traffic.
SMART Question 2: Do Nearby Road Elements affect the severity of traffic?
To determine whether nearby road components have an impact on the severity of the traffic, we have conducted the exploratory data analysis listed below.
clean_acc21 %>%
group_by(Amenity, Bump, Crossing, Give_Way, Junction, No_Exit, Railway, Roundabout, Station, Stop, Traffic_Calming, Traffic_Signal, Turning_Loop)%>%
summarise(percentage = n()/nrow(clean_acc21) *100 )%>%
arrange(-percentage) %>%
filter(percentage > 1) -> accidents_per_roadelement
## `summarise()` has grouped output by 'Amenity', 'Bump', 'Crossing', 'Give_Way',
## 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop',
## 'Traffic_Calming', 'Traffic_Signal'. You can override using the `.groups`
## argument.
head(accidents_per_roadelement)
## # A tibble: 6 × 14
## # Groups: Amenity, Bump, Crossing, Give_Way, Junction, No_Exit, Railway,
## # Roundabout, Station, Stop, Traffic_Calming, Traffic_Signal [6]
## Amenity Bump Crossing Give_Way Junction No_Exit Railway Round…¹ Station Stop
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 False False False False False False False False False False
## 2 False False False False True False False False False False
## 3 False False False False False False False False False False
## 4 False False True False False False False False False False
## 5 False False True False False False False False False False
## 6 False False False False False False False False True False
## # … with 4 more variables: Traffic_Calming <chr>, Traffic_Signal <chr>,
## # Turning_Loop <chr>, percentage <dbl>, and abbreviated variable name
## # ¹Roundabout
Here, we’ve compiled a list of incidents where surrounding road features like an amenity, a bump, a crossing, a give-way, a junction, a no-exit, a railroad, a roundabout, a station, a stop, a traffic signal, or a turning loop contributed to the accident.
acc_road_element_per <- tibble(c("None", "Junction","Crossing", "Traffic signal", "Crossing and traffic signal", "Station", "Stop" ), pull(accidents_per_roadelement, percentage), .name_repair = ~ c("road_elements", "percentage"))
acc_road_element_per
## # A tibble: 7 × 2
## road_elements percentage
## <chr> <dbl>
## 1 None 77.2
## 2 Junction 6.53
## 3 Crossing 4.11
## 4 Traffic signal 2.94
## 5 Crossing and traffic signal 2.46
## 6 Station 1.54
## 7 Stop 1.38
The tibble above illustrates accidents that happened as a result of local road elements in areas where there are more than 1% of road accidents.
The graph above shows the accidents that happened as a result of
surrounding road features.
accidents_per_roadelement<-clean_acc21 %>%
select(Is_severe,Junction,Crossing,Stop,Station,Traffic_Signal)
head(accidents_per_roadelement)
## Is_severe Junction Crossing Stop Station Traffic_Signal
## 8893 Not Severe False False False False False
## 8894 Not Severe False False False False False
## 8895 Not Severe False False False True False
## 8896 Not Severe False False False False False
## 8897 Not Severe False False False False False
## 8898 Not Severe False False False False False
Here, we’ve created a data frame with columns for junction, crossing, stop, station, and traffic signal, as well as a severity rating.
Junction_data<- subset(accidents_per_roadelement, Junction=="True",
select=c(Is_severe,Junction))
head(Junction_data)
## Is_severe Junction
## 8907 Not Severe True
## 8913 Not Severe True
## 8923 Not Severe True
## 8931 Not Severe True
## 8934 Not Severe True
## 8988 Not Severe True
Crossing_data<- subset(accidents_per_roadelement, Crossing=="True",
select=c(Is_severe,Crossing))
head(Crossing_data)
## Is_severe Crossing
## 8939 Not Severe True
## 8940 Not Severe True
## 8983 Not Severe True
## 8989 Not Severe True
## 8992 Not Severe True
## 8996 Not Severe True
Stop_data<- subset(accidents_per_roadelement, Stop=="True",
select=c(Is_severe,Stop))
head(Stop_data)
## Is_severe Stop
## 8939 Not Severe True
## 8964 Not Severe True
## 9016 Not Severe True
## 9123 Not Severe True
## 9133 Not Severe True
## 9135 Not Severe True
Station_data<- subset(accidents_per_roadelement, Station=="True",
select=c(Is_severe,Station))
head(Station_data)
## Is_severe Station
## 8895 Not Severe True
## 8922 Not Severe True
## 8939 Not Severe True
## 8959 Not Severe True
## 8974 Not Severe True
## 8985 Not Severe True
Traffic_Signal_data<- subset(accidents_per_roadelement, Traffic_Signal=="True",
select=c(Is_severe,Traffic_Signal))
head(Traffic_Signal_data)
## Is_severe Traffic_Signal
## 8930 Not Severe True
## 8940 Not Severe True
## 8952 Not Severe True
## 8970 Not Severe True
## 8983 Not Severe True
## 8995 Not Severe True
Here, the data frame is divided based on the accident’s severity level and the surrounding road element that caused it.
ggplot(data=Junction_data, aes(x=Is_severe, y=Junction, fill=Is_severe)) +
geom_bar(stat="identity")+
labs(title="Accidents By Junction", x="Severity", y = "Junction")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(data=Crossing_data, aes(x=Is_severe, y=Crossing, fill=Is_severe)) +
geom_bar(stat="identity")+
labs(title="Accidents By Crossing", x="Severity", y = "Crossing")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(data=Stop_data, aes(x=Is_severe, y=Stop, fill=Is_severe)) +
geom_bar(stat="identity")+
labs(title="Accidents By Stop", x="Severity", y = "Stop")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(data=Station_data, aes(x=Is_severe, y=Station, fill=Is_severe)) +
geom_bar(stat="identity")+
labs(title="Accidents By Station", x="Severity", y = "Station")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(data=Traffic_Signal_data, aes(x=Is_severe, y=Traffic_Signal, fill=Is_severe)) +
geom_bar(stat="identity")+
labs(title="Accidents By Traffic Signal", x="Severity", y = "Traffic Sisgnal")+
theme(plot.title = element_text(hjust = 0.5))
The aforementioned graphs show the degree of severity caused by neighboring road elements such junctions, crossings, stops, stations, and traffic signals.
accidents_per_roadelement <- accidents_per_roadelement %>%
mutate(severe_num = if_else(Is_severe== "Severe", 1, 0))
accidents_per_roadelement <- accidents_per_roadelement %>%
mutate(Crossing_num = if_else(Crossing== "True", 1, 0))
accidents_per_roadelement <- accidents_per_roadelement %>%
mutate(Stop_num = if_else(Stop== "True", 1, 0))
accidents_per_roadelement <- accidents_per_roadelement %>%
mutate(Station_num = if_else(Station== "True", 1, 0))
accidents_per_roadelement <- accidents_per_roadelement %>%
mutate(Traffic_Signal_num = if_else(Traffic_Signal== "True", 1, 0))
accidents_per_roadelement <- accidents_per_roadelement %>%
mutate(Junction_num = if_else(Junction== "True", 1, 0))
In order to run the anova test, we are turning the category data presented here into numerical data.
Junction_anova = aov(Junction_num~severe_num, data=accidents_per_roadelement)
summary(Junction_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## severe_num 1 18 18.066 288.4 <2e-16 ***
## Residuals 1423119 89156 0.063
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Crossing_anova = aov(Crossing_num~severe_num, data=accidents_per_roadelement)
summary(Crossing_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## severe_num 1 12 12.333 174.5 <2e-16 ***
## Residuals 1423119 100567 0.071
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Stop_anova = aov(Stop_num~severe_num, data=accidents_per_roadelement)
summary(Stop_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## severe_num 1 1 0.6189 30.45 3.42e-08 ***
## Residuals 1423119 28921 0.0203
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Station_anova = aov(Station_num~severe_num, data=accidents_per_roadelement)
summary(Station_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## severe_num 1 8 7.844 276.9 <2e-16 ***
## Residuals 1423119 40314 0.028
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Traffic_signal_anova = aov(Traffic_Signal_num~severe_num, data=accidents_per_roadelement)
summary(Traffic_signal_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## severe_num 1 1 0.5787 8.023 0.00462 **
## Residuals 1423119 102642 0.0721
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the outcome variable is quantitative and the predictor variable is categorical, we are doing an anova test. H0 : The mean of Severity of traffic are the same across Nearby Road Elements( Junction,Crossing, Stop,Station, Traffic Signal). H1 : The mean of Severity of traffic are the not the same across Nearby Road Elements( Junction,Crossing, Stop,Station, Traffic Signal). The P-value from Anova test is lower than 0.05 for all the Nearby Road Elements variable so we can reject the H0. → Nearby Road Elements Junction,Crossing, Stop, Station and Traffic Signal affects the Severity of Traffic.
Does the occurrence of the accident during a particular time of day or year affect the severity of the accident?
First, let’s look at some grapghs to get a better idea.
The Graph above shows frequency of accidents in 2021 during different Months. We can see that there is an increase in number of accidents during the end of the year.
The Graph above shows the frequency of accidents in 2021 during different hours of the day. It can be observed that there is spike in frequency of accidents during afternoon to evening, probably because it is the peak hours.
Took a Chi-Squared test to see if the severity and Hour of day are independent. We have performed Chi-Squared test because both variables are categorical. H0 : Severity and Hour are independent H1 : Severity and Hour are NOT independent
test_Hour <- chisq.test(table(clean_acc21$Severity, clean_acc21$Hour))
test_Hour
##
## Pearson's Chi-squared test
##
## data: table(clean_acc21$Severity, clean_acc21$Hour)
## X-squared = 1641.7, df = 23, p-value < 2.2e-16
We can reject the H0 because the p value is less than 0.05.
Took a Chi-Squared test to see the severity and weather Months are independent. We have performed Chi-Squared test because both variables are categorical. H0 : Severity and Month are independent H1 : Severity and Month are NOT independent
test_Month <- chisq.test(table(clean_acc21$Severity, clean_acc21$Month))
test_Month
##
## Pearson's Chi-squared test
##
## data: table(clean_acc21$Severity, clean_acc21$Month)
## X-squared = 443.62, df = 11, p-value < 2.2e-16
We can reject the H0 because the p value is less than 0.05.
Correlation test performed to test strength of relationship between severity and Month.
cor(clean_acc21$Severity, clean_acc21$Month)
## [1] -0.007085495
Since the value is so small, we can conclude that the the relationship is weak.
Correlation test performed to test strength of relationship between severity and Hour.
cor(clean_acc21$Severity, clean_acc21$Hour)
## [1] -0.002004478
Since the value is so small, we can conclude that the the relationship is weak.
As mentioned earlier, our dataset represents real world scenario. And there aren’t many accidents that affect the traffic severely. But in the small amount of cases that it does happen, it will be due to the following factors. Does weather affect the severity of traffic? • Temperature/Wind Chill/Wind Speed/Humidity/Pressure/Precipitation affect the severity of traffic. • Wind Direction does not affect the severity of traffic much. • Weather Conditions affect the severity of traffic. Do Nearby Road Elements affect the severity of traffic? • Nearby Road Elements Junction, Crossing, Stop, Station and Traffic Signal affects the Severity of Traffic. Does the occurrence of the accident during a particular time of day or year affect the severity of the accident? • Both Hour and Month affect the Severity but with a weak relationship.